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For Title I evaluations, it may be appropriate to 


evel; that is, to override publisher's recommendations 


concerning the difficulty, length, and content appropriate for a 
particular grade. It is seldom necessary, However, to move more than 
cne grade down. If the mean is substantially higher than the median, 
then some pupils will have encountered the floor of a test and an 
easier level of tests should have been chosen. The ceiling of most 
tests becomes a handicap when three-quarters of a’ group can answer 
the most difficult items correctly. In this case, the mean is 
substantially lower than the median and a more difficult test should 
have been chosen. In general, the level of a test is suitable waen 
the raw score of the group is equal to or above a third of the 
maximum score, and somewhat less than three-quarters of the maximum. 
In norm-referenced evaluations out-of-level testing is possible with 
most standardized achievement tests because they provide tables for 
relating raw scores on out-of-level tests to in-level percentile 


norms. (CP) 
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number of items to have at each 
iculty, the length of time the tests 
nd the suf{tability of the content 
toup at which the tests will: be 
tably some of these goals conflict 
er so that test design becomes .a. 


comises, 


e, a test of reading ability could 
be designed that would take only an hour to give, 
and would be fisable at all grade levels, That 
would mean that first-grade students would be 
profitably ocftupied for about five minutes and 
frustrated the rest: of the time, while twelfth- 
grade student®B wuld be overcome with boredom, 
Or again, 4 firet grader could take some pride 
in reading "The cat sat on the mat" while a 
fourth-grade $tudenit? even though a very poor | 
reader, might feel insulted by the choice of 


content. 
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e these problems, tests are usual- 
with different levels, each level 
table in terms of both content and 
children of specific ages or in 
8. The wider the age/grade band 
articular level, the more likely it 
the difficulty or the content (or 
poorly matched to pupils at the up- 
ends of the distribution, On. the 
cusing test levels on too narrow 
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ems, including that of simply being 
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REASONS FOR OUT-OF-LEVEL TESTING : 


The publishers of the major achievement tests 
have highly qualified personnel, up-to-date tech- 
niques, and years of experlence developing and 
scaling tests. While it would be hard to improve 
upon the compromises they have made with respect 4 
to test level difficulty, length, and content, 
special conditions may sometimes make it desirable 
to override the publisher’s recommendations as to 
which level of -a test should be used at a particu- 
lar grade, This circumstance is likely to arise 
in Title bi settings when students with the poorest 
performances are tested, It should be noted that 
a lower level test will not be needed for every 
Title I group. A look at some of the factors that 
are considered by test designers might be useful «4 
when trying to decide whether to test out of level. 


No test measures exhaustively; it samples 
skills or abilities and {it does so for the best 
and the poorest students simultaneously. Data 
from samples, as in the case of opinion polls, can 
wield quite accurate predictions, but the approxi-— 
mations’ are poorer when samples are smaller. ‘Thue . 
we need to ensure that the proportion of test ma- 
terial on which students can profitably spend time 
and effort dves not drop too low--as it would if 
the test were ecither much too difficult or far too 
easy. 


——! 


Floor Effects 


What is the- aptimal.proportion? To answer 
this question first consider multiple-choice in- 
struments; they have enough advantages to make 
them the best choice for standardized tests, but 
they do have some unavoidable disadvantages. | Une 
of these is that even when a group of students 
is quite out of its -depth, erroneous thought pro- 
cesses or guessing can yield apparently "inter- 
pretable" scores. For example, df 100 students 


2 


\ 


completed a 32-item, four-chuice test by puessing 


alone, the average score would be about eight, 


and the scores would range from a low of about 


‘three to a high of around thirteen, This range 


that occurs as a result of guessing is a serious 
problem because it severely: reduces the reltabil- 


ity of a test. 


From the example above, it can be seen that 
it is possible to encounter the "floor" of a test 
even though no students have scored zero. Long 
before the average raw score of a group is at 
chance level (the score equal to the total number 
of questions divided by the number of alternative 
answers to each question), some pupils in the 
group will have encountered the floor and an eas- 
fer level of test should have béen chosen. 


If you suspect that you may have encountered 
a floor effect, a convenient check is to compare 
the mean with thedmedian of the scores. If the 
mean is substantially higher than the median 
(by about a third of a standard deviation) then 
your suspicions are very likely confirmed. 


Ceiling Effects : 
Unfortunately, there is also the other ex- 
treme--too easy a test. If we seek to avold pos- 

sible “floor” effects by choosing a lower level 

of test, we could bump up against the "ceiling." 
Once again this can occur even if no one achieves 
the highést possible raw.score since carelessness 
and accident are more likely to occur if the test 
items are too easy. It is more difficult to find 
a way of setting limits here, but, in practice, 
the ceiling of most tests becomes a serious hand 1i- 
cap when three-quarters of a group can answer the 
most difficult items correctly. 


If you suspect that you may have encountered 
a ceiling effect, a conventent check is to compare 


the mean with the median of the scores. If the 
mean is substantlally lower than the median (by 
about a third of a standard deviation), then your 
suspicions are very likely confined, 


ry 
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DETERMINING THE APPROPRIATE LEVEL 
In most instances the level of a test is suit- 


- able when the méan raw score of the group is equal 
to or above a thicd of the maximum score, and 


somewhat less than three-quarters of the maximum, 


The highest celfability of a test is achitved when 
the students, on the average, get slightly more 
than half the {tems correct, Nowever, unless pre- 
vious test scores are available as guidance, one 
has to depend upon teaching experfence and judg- 
ment to select the correct test levels. I[t should 
seldom, if ever, be necessary to move more than 
one level dow; and even that is likely to be un- 
necessary when, for example, the group comes from 
grade 4 and the test Is suitable for grades J 

and 4, 


Test publishers try to avoid the occurrence of 
ceiling and tloor effects and to construct their 
tests so that the median score at the appropriate 
grade level is well above half the number of items 
in the test. Thus, for an average class, students 
are more Likely to score close to the ceiling of 
the test than close to its floor, If the same 
test is used at a higher grade level, che trend 
is tnereased, 


Tt can be scen that, 1f ‘too low a test level 
is used, scores will be artifictally depressed. 
LF this occurs on the posttest and not on the pre- 
test, galus will also be depressed, If the test 
‘ceiling is cncountered only on the pretest and not 
on the posttest (because, presumably, the level 
was changed), gains will be spuriously inflated, 
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It is never proper to do out-of-level testing 
simply go give pupils the experience of success-- 
espec ly when the practice could result in en- 
countering test ceilings. On the other hand if, 
as in the Stanford Achievement Tests, many of the 
levels are intended for use at the end of one 
grade and at the beginning of the next only, it 
would be quite reasonable to use the lower level 
for both pre- and posttest in a fall-spring de- 
sigh, even where the fall testing was in-level 
and the spring ic out-of-level, 


INTERPRETING SCORES FROM OUT-OF-LEVEL 
TESTING : 


In norm-referenced evaluations, out-of-level 
testing is possible only with cests that provide 
an expanded standard score scale. This scale 
allows the raw scores on the out-of-level test 
to be related to the in-level percentile norms. 
HNost of the major standardized achievement tests 
presently have this type of scale, but the con- 
versions which must be made will depend upon 
whether the publisher has provided raw-score-to- 
_ percentile,*or standard-score-to-percentile con- 
version tables. In either. case, the goal is to 
determine the percentile rank (or NCE) that would, 
in theory, have been obtained if the appropriate 
level of test had been used, 


Tests with Expanded-Standard-Score-to-Percentile 
Conversion Tables 


Some achievement tests convert raw scores to 
expanded standard scores for each test level, and 
then ,provide a separate table converting the ex- 
panded standard scores to percentiles for each 
grade and time of year. Tests requiring these 
conversions include the Iowa Test of Basic Skills, 
the Metropolitan Achievement Test, the Sequential 


‘Tests af. Faeartount Progress II, and the SRA 
Achievement ‘Series. (Note that the expanded stan- 
dard score scales may have different names in dif- 


ferent tests, c.g., "standard scores, scale 
oo 


scores," orc "converted scores." The name does not 


‘ always"inditate whether the scores are expanded to 


. 


cover different grade levels. For this informa- 
tion, refer to the RNC Technical Paper No. 5, en- 


titled C Characteristics of Kight Commonly Used, 
Nationally Normed Tests, 0 or to the test publish- 


o 
ers’ manuals.) 


For tests that convert expanded ' standard 
scores to percentiles, use the following proce- 
dure: convert each student’s raw score on the 


* fevel of the test Which was administered to its 


corresponding expanded standard score and compute 
the averaye. Then, convert the average expanded 
standard score to a percentile or NUE using the 
tables for the “appropriate” test level. 


Suppose we have a group in grade 4 for which 
the Green Level redding test is nominally recom- 
mended, Instead we used the Blue Level which is 
one level lower. Assume that the testing was done 
in the fall at a time which corresponded to an ecm- 
pirical normative data point. To obtain the ap- 
propriate percentile (or NCE) value, .we should: 


1. Convert each student’s raw score to an ex- 
panded standard score using the table for the 
Blue Level. 


2. Find the average of these standard scores. 
(Assume it was 64.) 


3. In the manual for the Green Level, use the 
standard-score-to-percentile conversion table 
for beginning of 4th grade and find the per- 
centile (or NCE) that corresponds to the stan- 
dard score. In this example a standard score 
of 64 might correspond to the percentile rank 
of 39. 


Tests with Raw-Score-to-Percentile Conversion Tables a 


Instead of the conversion tables just dis- 
cussed, some tests provide tables that convert 
raw scores to expanded standard scores, and raw 
scores to percentile ranks, These tests include 
the “California Achtevement Tests, the Comprehen- 
sive Tests of Basic Skills,‘ gnd the Stanford 
Achievement Test. For- such’ tests, we would con- 
vert the raw scores for the level of the test 
which was administered to expanded standard 
scores and find their mean. This value we would 
then take to the tables for the "appropriate" 
level, and find the corresponding, in-lével raw 
score, Finally, in the appropriate table for 
this higher level, we would find the percentile 
rank (or NCE) for that raw score, 


Suppose, for example, that we posttested a 
grade 7 group in spring using the Orange Level 
of a reading test when the Red Level was the rec- 
ommended one. To find the appropriate percentile 
(or NCE) we should: 


1. Convert the raw score of each student to the 
corresponding expanded standard score, using 
the Orange Level raw-score-to-expanded-stan- 
dard-score conversion table, 


2. Find the average of these standard scores. 
(Assume it was 423). 


3. In the manual for the Red Level, use the raw- 
score-to-standard-score conversion table to 
determine the in-level raw score corresponding 
to the mean standard score, Our value of 423 
might correspond to a raw score of 33, 


4. Finally, in the same manual (Red Level), using 
the end-of-/th-grade tables, convert this raw 
score to a percentile (or NCE). In our exam- 
ple, the raw score of 33 might have a corre- 
sponding percentile value of 28. 
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-It is easy to see that this process is similar 
to the one used for tests that convert expanded 
standard scores to percentiles, but one extra step 
is needed to go from the expanded standard score 
to the appropriate in-level raw score. 


